Categorizing Unsupervised Relational Learning Algorithms
نویسندگان
چکیده
We outline some criteria by which to compare unsupervised relational learning algorithms, and illustrate these criteria with reference to three examples: SUBDUE, relational association rules (WARMR), and Probabilistic Relational Models. For each algorithm we ask, What form of input data does it require? What form of output does it produce? Can the output be used to make predictions about unseen inputs? Categorizing the existing unsupervised relational learning algorithms helps us to understand how each algorithm relates to the others (no pun intended). We can identify important gaps in coverage that could be fruitful areas for future research. 1 What do we mean by unsupervised? In this paper we outline some criteria by which to compare unsupervised relational learning algorithms. We begin by clarifying what we mean by an unsupervised learning algorithm. A supervised learning algorithm distinguishes one attribute of its input instances as the target and learns a model designed to predict the value of the target attribute for previously unseen inputs. The target attribute can be discrete, as in classification, or continuous. An unsupervised learning algorithm does not treat any particular attribute of its input instances as the target to be learned. There is no teacher who gives the correct answer; there is no one correct answer. In some cases, the model produced by an unsupervised learning algorithm can be used for prediction tasks even though it was not designed for such tasks. The distinction between supervised and unsupervised learning is a spectrum on which some algorithms are at the extremes and others are toward the middle. SUBDUE is clearly an unsupervised learning algorithm. ∗This effort is supported by DARPA and AFRL under contract numbers F30602-00-2-0597 and F30602-01-2-0566, and by NSF under contract number EIA9983215. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation hereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of DARPA, AFRL, NSF, or the U.S. Government. It recognizes repeated substructures in a labeled graph, and can be used for graph compression [Cook and Holder, 1994] and for hierarchical clustering [Jonyer et al., 2001], but not prediction. Relational Markov Networks [Taskar et al., 2002] are designed for discriminative training: they fall at the supervised end of the spectrum. Probabilistic Relational Models are more toward the middle. PRMs learn a dependency structure which can enhance a domain expert’s understanding of the data [Getoor et al., 2001]. They can model uncertainty in the relational structure of the domain [Getoor et al., 2002]. They can be used for classification and for clustering [Taskar et al., 2001]. The underlying learning algorithm is the same, but the relational data structures given as input are adapted to the desired task. 2 Criteria of comparison Unsupervised relational learning algorithms can be categorized along several different axes: • What form of input data does the algorithm require? • What form of output does it produce? • Can the output be used to make predictions about unseen inputs? To describe the input data configuration, we employ the terms object, link, and attribute. (We choose link instead of relation to avoid confusion with the terminology of relational database management systems.) In our framework, relational data consist of objects connected together by links. Both objects and links can have attributes. An attribute is a namevalue pair. The input for any learning algorithm that claims to be “relational” must have links as well as objects. A link can be represented explicitly by an edge in a graph, or implicitly by a pointer to the related object. The number of attributes allowed for each object or link can be none, exactly one, or many. The input database can consist of a single connected component, or a set of connected components. The output of a relational learning algorithm is a pattern (using the term loosely) that expresses a generalization supported by the input data. The scale of the pattern might be a single object, or a structure consisting of a group of related objects and the links that connect them. All patterns produced by a relational learning algorithm are descriptive because they capture regularities of the input data; some patterns can also be used to make predictions about unseen data. Categorizing the existing unsupervised relational learning algorithms helps us to understand how each algorithm relates to the others (no pun intended). Our goals in developing this categorization are • to establish a common vocabulary in which to express the similarities and differences of relational learning algorithms; • to identify interesting areas of unsupervised relational learning that are currently underdeveloped. 3 Three example algorithms We illustrate our multi-dimensional categorization of unsupervised relational learning algorithms by comparing three systems that differ widely in their input and output formats. The WARMR algorithm [Dehaspe et al., 1998; Dehaspe and Toivonen, 2001] finds relational association rules or, to use the vocabulary of the authors, query extensions. The algorithm takes as input a Prolog database and a specification (in the WARMODE language) that limits the format of possible query extensions. The output of WARMR is a set of query extensions, all of which refer to the object designated as the key parameter. The query extensions are not limited to attributes of the key object, but can include its links to other objects and their attributes. The SUBDUE system [Cook and Holder, 1994] iteratively discovers repeated substructures in a graph and compresses the graph by replacing the repeated substructure with a single vertex. The algorithm takes as input a labeled graph and a set of rules intended to bias the search process toward structures that are deemed more interesting. SUBDUE returns as output the substructure selected at each iteration as the best to compress the graph. Probabilistic Relational Models (PRM) reinterpret Bayesian networks in a relational setting. PRMs have been evolving rapidly over the past few years; we focus here on the version described in [Getoor et al., 2002]. A PRM captures the probabilistic dependence between the attributes of interrelated objects. It can also model uncertainty about the link structure. Reference uncertainty means we know how many links there are in the graph, but we don’t know what their endpoints are. Existence uncertainty means we don’t know how many links there are and have to consider the possibility that any pair of objects (of the appropriate types) might be linked. The input to the PRM learning algorithm is a database schema (specifying objects, links, and attributes) and an instantiation of that schema (a set of relational tables). 4 Input criterion of comparison The first criterion of comparison concerns the input to the unsupervised relational learning algorithm. Our three example algorithms have very different data representations, but conceptually we can view their input in terms of objects and links. For SUBDUE the mapping is straightforward: objects correspond to vertices in the graph, and links to edges. SUBDUE requires exactly one attribute on each object and link in the graph: a label. In the Inductive Logic Programming approach of WARMR, the input data are a set of Prolog facts, describing both objects and links. The predicate name is the equivalent of a type attribute. For example (from [Dehaspe and Toivonen, 2001, p. 191]), a fact such as
منابع مشابه
Cs 6604: Data Mining 2 Relational Apriori 2.1 Queries
1 Overview In the previous several lectures, we mainly discussed algorithms for ILP, i.e., supervised learning of rela-tional predicates. In this class, we focus on unsupervised algorithms for learning relational patterns. In particular, we look at algorithms that are the relational equivalent of traditional enumerative search algorithms. Take, for instance, the most famous algorithm in the ass...
متن کاملHigh-Dimensional Unsupervised Active Learning Method
In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...
متن کاملINTEGRATED ADAPTIVE FUZZY CLUSTERING (IAFC) NEURAL NETWORKS USING FUZZY LEARNING RULES
The proposed IAFC neural networks have both stability and plasticity because theyuse a control structure similar to that of the ART-1(Adaptive Resonance Theory) neural network.The unsupervised IAFC neural network is the unsupervised neural network which uses the fuzzyleaky learning rule. This fuzzy leaky learning rule controls the updating amounts by fuzzymembership values. The supervised IAFC ...
متن کاملMulti-type Relational Clustering Approaches: Current State-of-the-Art and New Directions
The proliferation of multi-type relational datasets in a number of important real-world applications and the limitations resulting from the transformation of such datasets to fit propositional data mining approaches have led to the emergence of the discipline of multi-type relational data mining. Clustering is an important unsupervised learning task aimed at discovering structure inherent in da...
متن کاملA survey of multilinear subspace learning for tensor data
Increasingly large amount of multidimensional data are being generated on a daily basis in many applications. This leads to a strong demand for learning algorithms to extract useful information from these massive data. This paper surveys the field of multilinear subspace learning (MSL) for dimensionality reduction of multidimensional data directly from their tensorial representations. It discus...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003